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abstract 


In  this  report  it  is  shown  that  in  the  context  of  a specific  pattern  classi- 
fication decision  metric  the  number  of  samples  M needed  to  characterize  a 
cluster  described  by  N features  is: 

M > (1  + + 2) 

where  B represents  an  interval  width.  The  distance  metric 
d^Cx)  - (X  - - ^'X^ 

is  shown  to  have  an  F distribution  which  leads  to  result  (i).  An  additional 
application  of  the  distribution  of  (li)  is  discussed  in  terms  of  a specific 
type  of  pattern  classifier. 
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I . INTRODUCTION 

In  choosing  the  number  of  samples  that  are  required  to  estimate  a correla- 
tion matrix  of  a multivariate  Gaussian-process , a number  of  guidelines  have  been 
suggested.  With  N equal  to  the  dimensionality  of  the  matrix,  and  M equal  to  the 
number  of  samples  the  following  results  have  been  obtained: 

i)  M ^ N 

ii)  M > 2N  (Reed  et  al.)^ 

Hi)  M > 2N  (Cover)  ^ 

iv)  M > 5N  (Foley) 

The  first  two  have  been  studied  in  the  context  of  estimation.  The  first 
result  states  that  in  order  for  the  estimate  of  the  matrix  given  in  Eq.  (1)  to 
be  nonsingular,  the  number  of  samples  must  be  greater  than  or  equal  to  the 
dimensionality  of  the  matrix.  The  estimation  equation  for  the  correlation 
matrix  formed  from  vectors  X2^,....,  X^^)  is  given  by 


R 


1 ” t 

i ^ x,x: 

” i=i  ^ ^ 


(1) 


The  second  result  was  derived  by  considering  the  expected  signal-to-noise  ratio 

(S/N)  that  could  be  obtained  in  an  adaptive  antenna  system  with  an  infinite 

number  of  samples  [(S/N)^]  and  then  finding  the  number  of  samples  M necessary 

to  estimate  the  antenna  weights  such  that  the  expected  (S/N)„  = 4(S/N)  . 

M Z 

The  third  and  fourth  results  were  derived  in  the  context  of  detection 
problems  - specifically  pattern  recognition  ones  - and  presented  the  point 
of  view  that  a sufficient  number  of  samples  should  be  used  so  that  samples 


1 


of  a single  multivariate  random  process  would  not  appear  to  be  samples  from 
two  processes.  Alternatively,  where  two  processes  were  present  and  the 
Bayes'  error  for  classification  could  be  calculated,  this  error  could  be  con- 
sidered as  the  M = °°  case.  We  then  ask,  with  N dimensions,  how  many  samples 
are  necessary  such  that  the  probability  of  error  is  close  to  the  Bayes'  error, 
i.e. , 


The  results  of  these  studies  led  to  (iii)  and  (iv) . 

II.  A NEW  APPROACH  BASED  ON  CONFIDENCE  INTERVALS 

Let  us  assume  that  our  sample  vectors  {x^ | i=l , . . . . ,m}  come  from  a multi- 
variate Gaussian  random  process  with  mean,  Uv>  covariance  matrix,  E . 

A X 

Consider  the  quadratic  form  given  by 

d^(X)  = (X-p^)  (2) 

The  term  d may  be  considered  as  the  distance  from  the  sample  X to  the  mean 
p measured  in  standard  deviation  units.  This  distance  may  then  be  inter- 

A 

preted  either  in  the  context  of  a signal-to-noise  ratio  calculation  or  in  the 

context  of  a minimum  distance  classification  rule.  We  also  observe  that  if 

p and  Z are  being  estimated  (by  p and  S ) from  a sample  population  by  the 
AX  XX 

* 

consistent  equations: 

_ 5^ 

A consistent  estimator  6 of  a parameter  6 is  one  for  which 

/s 

lim  Pr[ |e-0|>e]  - 0 


2 


(3a) 


1 

M 


M 

Z X 


S 


X 


1 

M-1 


M /^  ^ t 

Z (X  -P)(X  -u) 

, m m 


(3b) 


then  d corresponds  to  the  infinite  sample  case.  Since  X is  a random  vector 
2 

d (X)  is  a random  variable  and  we  would  like  to  determine  its  distribution. 

2 

The  Distribution  of  d (X) 

Since  X is  a Gaussian  random  vector,  distributed  as  N(xlvi  ,Z,),  any 

X X 

linear  function  of  X is  a Gaussian  random  vector.  Let 


y = W(x-p^) 


(4) 


then: 


by  = E[W(X-Ux)]  = W . E[(X-M^)]  = 0 (5a) 

= Elyy*^]  = W Zj^  W*"  (5b) 

d^(y)  = (W'V)*^  Z"’-  (W“^y) 

d^y)  = y‘^[(W~^‘'  z;;\w"^]y  (5c) 

where  W is  an  (N  x N)  invertible  matrix.  Specifically  we  will  define  W as 


/ 


4t 


follows:  Let  be  an  eigenvector  of  Z such  that  I„4). 

1 A X 1 

to  be  the  matrix  whose  rows  are  {cf>^  | i=l , . . . ,n}  and  A to 
A = diag(A^,  A^^) . Then 


‘Ax 


AO*" 


and  since  the  eigenvectors  are  orthonormal: 

= I 


this  implies 


^-1 

$ = $ 


We  are  now  prepared  to  define  W as; 


W = A 


1/2  /■“  - 

where  A = diag  /A^,  ....  • Substituting  Ec 

we  arrive  at; 


I = A"^^^  $ T A"^^^  = I 

y X 


2 t ^ 2 

d (y)  = y y = Z y^ 
n=l 


= We  define  $ 

be  the  diagonal  matrix 

(6) 

(7a) 

(7b) 

(8) 

. (8)  into  Eqs.  (5b,  5c) 

(9a) 

(9b) 


A 


where  is  the  i component  of  the  y vector  and  is  distributed  according  to 

I 2 2 

N(y.|vJ  = 0,  O =1).  Then  d (y)  has  a X distribution  with  N degrees-of- 

1 Xi  Xj 

2 2 2 2 
freedom.  But  d (y)  (from  Eq.  5c)  equals  d (X)  so  that  d (X)  has  a X distri- 

2 

bution  with  N degrees-of-f reedom.  From  the  properties  of  the  X distribution 
we  know  that 


E[d‘^(X)]  = N 


(10a) 


Var[d  (X)l  = 2N 


The  Distribution  of 


When  only  a finite  number  of  data  samples  are  available  and  the 
parameters  U and  E must  be  estimated,  Eq.  (2)  takes  the  form: 


d^(X)  = (X  - y^^)"  (X-Uj^) 


2 2 

To  develop  the  distribution  of  <lj^(X)  we  will  need  the  T statistic.  The 

2 5 

definition  and  distribution  of  the  T variable  are  rephrased  from  Anderson 

in  the  following: 

2 t -1 

Theorem:  Define  T = p S p where  p is  distributed  according  to 
N(p|0,E)  and  (M-l)S  is  independently  distributed  as 


(M-l)S  = E Z Z 
1 mm 
m*l 


where  the  are  independently  distributed  as  N(Z^|0,Z).  Then 


5 


F 


T 


2 


M-N 

(M-l)N 


has  a central  F distribution  with  (N,M-N)  degrees-of-f reedom. 

We  treat  the  random  vector  X as  a sample  independent  of  the  set  used  to 

•k 

estimate  and  Then,  under  the  null  hypothesis  that  X is  distributed 

according  to  N(x|p  , E ) and  y is  distributed  according  to  N(y  |y  E„/M) 

A A A XXX 

and  is  as  given  in  Eq.  (3b),  we  set 


M+1  “ ^X^  ^X  “ ^X^ 


(12a) 


2 M2 


(12b) 


/ M ^ 2 

With  p = / (X  - M^)  we  have  that  the  random  variable  T has  an  F-distribu- 

tion  with  (N,M-N)  degrees-of-f reedom  and  a critical  region  specified  by 


> 


(M-l)N 

M-N 


^N,M-N^“^ 


(13) 


Fukunaga  and  Kessel  have  considered  the  case  where  X is  part  of  the  same 
sample  set  used  to  estimate  y and  Z . Using  the  scalar  statistic 

X A 


they  have  derived  a test  for  the  multivariate  normality  of  the  data  without 
prior  knowledge  of  the  mean  and  covariance  parameters. 


6 


r 


with 


""  level.  We 

follows:  °"""^™neEfd2 

and  Var[d^(x)] 


as 


It 


be  shown  that 


the 


(a)] 


M 1 VarlT'"]  = ,M-1  ]2 

M VarfF 


expected 


J N,M-NCa)J  (1 


value  of  the  f a- 

® F-distributx 


ion  With  (k,  . k - 

1 • ''9> 


LI  - 2 


The 


variance  is;^ 


k2  > 2 


(15c 


- 2)"(k 


Assuming  that 


2 - 4) 


the 


'^e  have 


= f«^onger  of  these 


^2  > 4 


conditions 


(15b) 


is  true,  i « , 

’ K - M-N  > 


MtN  I 

j M-N-2 

. 


[Efd^(X)]  = fW  '2^- 

" - - ^ 'M  ^ I 


] 


and 


(16) 


7 


with  a significance  level, 
follows : 


We  may  now  determine  E[dj_j(X)]  and  Var[d^(X)]  as 


E[d^(X)]  =^E[t2]  = 


Var[d^(X)]  = VartT^]  = 


Var[F, 


N,M-N 


It  can  be  shown  that  the  expected  value  of  the  F-distribution  with 

4 

degrees-of-freedom  is: 


(14a) 


(cx)]  (14b) 


. kp 


“f  - ’'2  ^ ^ 

o-u  . .4 

The  variance  is: 

2k5(k  + k - 2) 

= 5 k > 4 (15b) 

k^(k2  - 2)^(k2  - 4) 

Assuming  that  the  stronger  of  these  conditions  is  true,  i.e.,  k^  = M-N  > 4 
we  have 


E[d^(X)]  = 


.M+1. ,M-1.„ 


I*  M-N 

^M-N-2 

:[dj^(x)] 


= ( 


M+1.  N(M-l) 
■M  ^ M-N-2 


(16) 


and 


7 


2 


(17) 


V[d^(X)]  = (^) 


(: 


M-1 

M-N-2 


) 


Effects  of  Finite  Sample  Size 

We  now  have  two  expressions  for  distance;  one  based  on  finite  sample 

2 2 
size  d„(X)  and  one  on  infinite  sample  size  d (X).  As  M increases  we  have 
M 

from  Eqs.  (16)  and  (17) 


Lim  E[d^(X)]  = N = E[d^(X)]  (18) 


him  Var[d^(X)]  = 2N  = Var[d^(X)]  (19) 


Let  us  now  determine  (in  the  Reed  or  Foley  sense)  the  value  of  M required 
2 

so  that  the  E[dj^(X)]  is  with  100  3%  of  its  true  value.  Then 


E[d^(X)]  = N(^|^)(^)  = N(1  ± 3) 


(20) 


Since  the  left  side  of  Eq.  (20)  approaches  the  limiting  value  from  above  this 
may  be  written  as 


.M+1. . M-1 
^ M ^^M-N-2 


■)  = 1 + 3 


(21) 


Solving  for  M in  terms  of  N and  3 gives: 


8 


M = 


+ /(i+e)^(N+2)^  - 


(H-B)(N+2)  + /(l+g)  (N+2)  - 4g 

2B 


For  small  values  of  B an  excellent  approximation  is: 


(22) 


M = (l+B~^)(N+2) 


(23) 


In  Fig.  1 we  plot  M versus  N for  various  values  of  B.  For  a given  value 

of  N this  represents  the  minimum  number  of  samples  required  to  match  the  de- 

2 2 
sign  goal,  i.e.,  that  <i^(X)  be  within  a certain  percentage  of  d (X).  Note  that 

for  B=1  we  return  to  the  Reed  result  (ii)  that  M ^ 2N. 

III.  APPLICATION  TO  A CERTAIN  TYPE  OF  PATTERN  CLASSIFIER 

In  a number  of  pattern  classification  schemes  we  might  wish  to  refrain  from 
assigning  a class  label  when  the  distance  to  each  of  the  prototype  clusters  is 
too  large.  If  the  distance  squared  from  a point  to  be  classified  X to  a clus- 
ter 0)^  is  given  by 

d^(X,  0)^)  = (X-y^)*"  S~^  (X-y^) 

= d^(X,y^)  , (24) 

then  we  might  refuse  to  classify  the  sample  X if 


d^(X,y^)  > 0 -y-  i-1. 


.L 


9 


Flg.l,  Number  of  samples 


with  L = number  of  different  classes  and  0 “ some  arbitrary  threshold.  Alterna- 
tively we  might  consider  the  sample  X to  belong  to  the  class  labeled  "other". 


,1 


In  this  section  we  would  like  to  consider  how  to  choose  the  threshold  0. 
We  do  this  by  asking  the  question:  If  X were  a member  of  class  then  how  far 
should  we  expect  to  find  it  from  the  cluster?  The  answer  to  this  is  contained 
in  Eq.  (16).  As  an  example  with  M=100  samples  and  N»6  features  we  have 


E[d^(X)]  = (^) 


N(M-l) 

M-N-2 


6.52 


(25) 


The  one  standard  deviation  interval  on  the  mean  is  given  by: 

(26a) 

(26b) 

which  for  M=100,  N=6  gives 


E[d^(X)]  ± a[d^(X)] 


( jlT-l  ) 


^ /2N(M-2). 
M-N-4 


6.52  ± 3.93 

4 2 

Finally  using  F-tables  we  may  find  the  99%  confidence  interval  for  dj^(X)  as 

the  one-sided  interval 


P[0  £ d^(X)  < d^]  - .99 


11 


where 


,2  _ ,M+1»  \ 


Again  using  M=100,  N=6  we  have 


= 6,38  “ 19.156 


Thus  99%  of  the  time  an  independent  random  sample  drawn  from  a population 
characterized  by  6 features  and  100  samples  will  be  a distance  squared  less  than 
19,156  from  the  sample  cluster.  This  number,  19.156,  is  independent  of  the  true 
mean  y and  the  true  covariance  matrix  Z thus  it  is  the  same  for  all  clusters 
formed  from  100  samples  in  a six  dimensional  space.  It  is  thus  a reasonable 
choice  for  the  threshold  0. 

IV.  SUMMARY 

2 

Using  the  F-distributed  T statistic  we  have  determined  a relation  for  the 
number  of  samples  M needed  to  characterize  a multivariate  Gaussian  process  with 
dimensionality  N.  The  characterization  of  the  cluster  has  been  in  the  useful 
form  of  a distance  metric  which  may  be  interpreted  as  either  a signal-to-noise 
ratio  or  a pattern  classification  rule.  Further,  in  the  pattern  recognition 
context,  we  have  shown  that  the  results  can  be  interpreted  as  to  how  to  set  a 
distance  threshold  6 beyond  which  all  pattern  class  labels  would  be  rejected. 


12 


REFERENCES 


1.  Reed,  I.  S.,  et  al.,  "Rapid  Convergence  Rate  in  Adaptive  Arrays,"  IEEE 
Trans.  Aerospace  Electron  Systems  AES-10 , 853-863  (1974). 

2.  Cover,  T.  M.,  "Geometrical  and  Statistical  Properties  of  Linear  Inequalities 
with  Applications  in  Pattern  Recognition,"  IEEE  Trans.  Electron.  Computers 
EC-14,  326-334  (1965). 

3.  Foley,  D.  H.,  "Considerations  of  Sample  and  Feature  Size,"  IEEE  Trans. 

Inf.  Theory  IT-18,  618-626  (1972). 

4.  Burrington,  R.  S.  and  May,  D.  C.,  Handbook  of  Probability  and  Statistics 
with  Tables  (Handbook  Publishers,  Sandusky,  Ohio,  1953). 

5.  Anderson,  T.  W.,  An  Introduction  to  Multivariate  Statistical  Analysis 
(Wiley,  New  York,  1958). 

6.  Fukunaga,  K.  and  Kessell,  D.  L.,  "Error  Evaluation  and  Model  Validation 
in  Statistical  Pattern  Recognition,"  Purdue  University  Electrical 
Engineering  Technical  Report,  TR-EE  72-23,  Lafayette,  Indiana,  (1972). 


13 


_L!  ci-Assii  II  I) 


, UKII  Y Cl  ASMMCA  ■ ON  OF  IHlS  PAGF  H II.,:  i / ■.(. 


REPORT  DOCUMENTATION  PAGE 

K’hAIi  INSIRLC  T|f)NS 

HI  1 OKI  COMPLUINI.  l OKM 

J Wt  POW  T NUMBER 

2.  GOVT  accession  NO 

3 RECIPIENT’S  CATALOG  NUMBER 

i sn-  I R-TS-SO 

•J  TITL  F -rti  ^ 

5.  TYPE  Of  REPORT  & PERIOD  COVERED 

1 Lirthcr  runsicIfiMtion  ul'  s.iniple 

IIKI  I S|/x- 

Technical  .Note 

/ 

1 

6.  performing  ORG  REPORT  NUMBER 
Icctinical  Note  1978-13 

author 

8.  COnTRAF T OR  grant  number 

j l.in  1.  Yoiiiii; 

1 19028-78 -f -01 102 

f Q.  PI  RFOPMINC  ORGANIZAHON  NAME  AND  ADDRESS 

Lincoln  l.ahoratory,  M.  1.  T. 

P.  C).  (iox  73 

Lexington,  MA  02173 

10.  program  element,  project,  task 

AREA  & WORK  UNIT  NUMBERS 

Program  K lenient  .\o.  034311- 
Project  No.  1 227 

11,  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

12.  REPORT  DATE 

.\ir  l orce  systL-nis  (.'.onimaml.  USAI- 
Vnilrews  Al  1! 

WaNliingtoM.  DC  20331 

28  .\pril  1978 

13.  NUMBE  R OF  PAGES 

18 

14  MONITORING  agency  name  & ADDRESS  'if  Jr,>m 

(U’ntrtilling  t/fficci 

IS.  SECURITY  CLASS,  l.a  I'.IV 

Electronic  Systems  Division 

1 lanscom  Al  l> 
llctirortl.  M\  01731 

Unclassified 

ISrt.  DECLASSIFICATION  DOWNGRADING 

SCHEDULE 

16.  DISTRIBUTION  STATEMENT  ni/ lAn  H^poril 

\pprovcd  lor  public  release;  distribution  unlimited. 

17.  DISTRIBUTION  STATEMENT  (of  thf  uhst'oet  enttred  in  Hlock  20,  if  different  from  Hepori) 

18,  SUPPLEMENTARY  NOTES 

None 

19.  KEY  WORDS /T  ontinur  /m  rrirrs»*  side  if  necossnry  and  identify  by  block  number/ 

correlation  matrix  multivariate  Gaussian-process 

signal-to-noise  ratio 

20.  ABSTRACT  {(  ontmue  >>n  reverse  side  if  necessary  and  identify  by  block  number/ 

In  this  report  it  is  shown  tliat  in  the  context  of  a specific  pattern  classification  decision 
metric  the  niimlx-r  of  samples  M needed  to  characterize  a cluster  dcscrited  by  .\  features  is: 

M (1  4 p‘S(N  + 2) 

(i) 

where  fl  represents  an  interval  width.  The  distance  metric 

d^(X)  = (X  (X  - 

(ii) 

is  shown  to  h.ive  .in  i'  distribution  which  leads  to  result  (i).  An  additional  application  of  the 
sliStlilMlfla  12/  (ii)  is  discussed  in  terms  of  a soeciflc  tvoe  of  nattern  classifier. 

DO  I 1473  EDITION  OF  1 NOV  65  IS  OBSOLETE 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PAGE  l«hfn  Onln  tnlrr,JI 


